========================================================
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
There are 1599 observations with 13 variables. Most wine quality are in the median range of 6. Observed large difference between mean and max values for variables like free.sulphur.dioxide, total.sulphur.dioxide and sugar.
Let us take a first look at some variables by plotting them below
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity is long-tailed distribution. The log transform does not reveal anything new but it normalizes the distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Similar to fixed acidity, volatile acidity also has a long tail distribution. However, when we look at its log10 plots, we can see that the distribution looks a little binominal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar has a very long-tail distribution with many outliers. Some of these outliers are more than 9 standard deviations away from the median! It will be interesting to see how these outliers affect the quality of wine. In the log10 plots, the values are still very skewed, but it looks more like a normal distribution.
In the third plot, I removed the top five percent of data points to have a better understanding of core of the data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides have distribution similar to residual sugar and have a strong concentration around the median. We also note a lot of outliers from the box plot. In the second plot, the top two percent of data points were removed to help understand the distribution of points around the median.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Interesting to note that the free sulphur dioxide has a bi-modal distribution when we take the log10 transform. We also note that data is well spread out compared to the other features we have seen yet.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total sulfur dioxide is similar in ways to free sulfur dioxide. I would argue that its points are not quite a dispersed, as there are fewer outliers and its interquartile range does not look quite as large. It also has a long-tail distribution, but when we look at its log10 plot, the points are rather normally distributed.
I created a new variable from the bound sulfur dioxide (total sulfur dioxide - free sulfur dioxide) to see if it has any intereseting pattern. Let’s see the comparison of all three
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density has a very normal looking distribution with most of the values falling between 0.995 and 1. For comparison, water has a density of 1, so most of our wine is less dense than water. There are very few outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Another normal looking distribution, with most of the pH values falling between 3.1 and 3.5. Much like with density, there are very few outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates is more long-tail than density and pH, it still looks rather normally distributed, as most of the values are clustered around 0.6. An interesting point about sulphates, is that some of its outliers are very far away from median. It will be interesting to see how that affects the quality of wine. Looking at its log10 plots, sulphates is much more normally distributed, and there are still some outliers, despite the transformation.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol has a long-tail distribution, with there only being a few outliers. Looking at the log10 plots does not reveal many new insights, except that it still has a long-tail distribution and looks oddly like the original plots. Most wines have less than 11% alcohol which is true to knowledge as I rarely have picked up a wine personally that is more than 11% in alcohol content.
Quality is on a 1-10 scale, which means that most of the wines we will look at in the analysis are average wines. It will be interesting to try to find what can make a wine very good or very bad, and to see if there is much correlation between the variables.
The dataset is a tidy one and it has 1599 observations with 13 variables for each one. All of the observations are numerical. The first one is an index. The “quality” variable has only 6 discrete values: 3, 4, 5, 6, 7, 8.
Quality is main interest in the dataset. It would be interesting to see which features contribute most to the quality of the wine.
I expect alcohol, pH, residual sugar, and total acidity will contribute most to the quality of the wine. After a little research on red wine, people seem to enjoy a red wine that is neither tart, nor sweet, nor dry, but smooth and wet. It would be interesting to see the composition of different features for the good quality(7 or 8) wines in our dataset.
I created three new variables:
I noted that most of the wines rated either a 5 or 6. This could make it more difficult to determine what makes a good wine, as there is less data about them. Having more data about lesser wines would have also been useful to provide a better contrast between bad and good wines.
Most of the observations have an alcohol value between 9 and 12, with a median of 10. It is strange that wines with a quality of 5 tend to have less alcohol.
Regarding the new variables, bound sulfur dioxide (“nonfree.sulfur.dioxide”) tends to have bigger values than free sulfur dioxide. The percentage of free sulfur dioxide (“pfree.sulfur.dioxide”) has a distribution almost normal, with mean around 0.4.
For some of the features, I removed the top few percent of data points when looking at an additional plot. This was to have a better view of the core of the data, i.e. the interquartile range and how it is distrbuted.
As mentioned above I categorized the “quality” feature into bad,regular and good for better visualization plots for analysis later.
It seems that bad wines have a bigger volatile acidity, and they don’t have high citric acid values. Also they tend to have lower sulphate values. Good wines tend to have more alcohol.
Intresting! We find that our intial guesses were true about which factors would be realted to determine the quality of wine. We can see the alcohol and sulphates are postively correlated and volatile acidity is negatively correlated. My assumption that sugar will be an important factor for wine quality seems to be incorrect based on what the plot shows.
Let’s plot alcohol and quality
As per our observation, good quality wines have higher levels of alcohol.
Again, per our observtion we note that higher quality wines have higher levels of sulphates
This plot helps us note that high quality wines have less amount of volatile acidity.
I explored the impact of residual sugar on the quality as by feeling of gut I felt that it might impart taste/quality to the wine. But the plots show that residual sugar has no impact on the quality of the wine.
Good quality wines have higher levels of citric acid.
There is a jump for alcohol variable between qualities 5 and 6. Maybe this is a separation between potentially bad wines and potentially good wines.
Based on our previous analysis, we have been checking some correlations. We were able to explore some tendancy across the quality for: “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol”. All the cases except “volatile.acidity” are positive correlations. This is normal, because “volatile.acidity” is the concentration of acetic acid in wine, which present in too much concetration can lead to a sharp vinegar taste. For values of 5 in the “quality” variable the values for “alcohol” are very spread, although the tendency is that good wines (quality 7 or 8) have the highest median level of alcohol.
Furthermore, correlation matrices have given us a global overview of all pairwise relations in a numerical and graphical ways.
I was surprised to see that pH and volatile acidity are positively correleated, since a higher pH value means less acidity, but a higher volatile acidity means more acidity.
As expected, citric acid, acidity, and pH are all rather correlated, given that they all measure acidity.
Lastly, I was wrong in assuming that residual sugar may have a significant impact in the quality of the wine. Infact, it hardly contributes towards quality.
The strongest relationship, ignoring that between “total.sulfur.dioxide” and “bound.sulfur.dioxide”, is the negative correlation (-0.68) between “fixed.acidity” and “pH”. Of course the correlation is negative for “pH” because a low pH indicates a very acidic environment.
For this part, we will focus on our 4 main features we explored in the earlier plots and come up with a predictor model for quality.
For prediction purposes, we have two main problems: 1) Unbalanced spread of quality feature (too many regular wines) 2) The regular wines are very spread across feature values, so they are mixed with bad and good classes. Maybe what we should try is to predict good (or bad) wines, not to try to classify into the three classes.
Lets check only bad wines against good wines. In this case, we also add some density 2D maps in order to see where are located the clusters or groups for each combination of features:
Selecting only the “good” and “bad” wines helps us focus on the trends more specifically. Good wines have medium values of citric acid and low values of volatile acidity. Bad wines, on the other hand, medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less an horizontal line separating good and bad wines:
Now finally, I will build a simple linear model using our four main features(alcohol,volatile.acidity,sulphates and citric.acid).
##
## Calls:
## m1: lm(formula = I(quality ~ alcohol), data = red_wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red_wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = red_wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = red_wine)
##
## ================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646***
## (0.175) (0.184) (0.196) (0.201)
## alcohol 0.361*** 0.314*** 0.309*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.265***
## (0.095) (0.097) (0.113)
## sulphates 0.679*** 0.696***
## (0.101) (0.103)
## citric.acid -0.079
## (0.104)
## ----------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3
## adj. R-squared 0.2 0.3 0.3 0.3
## sigma 0.7 0.7 0.7 0.7
## F 468.3 370.4 268.9 201.8
## p 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1621.8 -1599.4 -1599.1
## Deviance 805.9 711.8 692.1 691.9
## AIC 3448.1 3251.6 3208.8 3210.2
## BIC 3464.2 3273.1 3235.7 3242.4
## N 1599 1599 1599 1599
## ================================================================
We can see that adding the “sulphates” adds small improvement but “citric.acid” do not improve the model(we saw this from our plots). The model is not such a good one as the R2 value is low(0.3 for model 3). Let’s check the accuracy of the model:
## [1] "Successful prediction by quality (0-10)"
## [1] 0.5822389
## [1] "Successful prediction by class"
## [1] 0.833646
## [1] "Let's see an example to predict the wine quality(alcohol= 11, volatile.acidity = 0.6 , sulphates= 0.7)"
## [1] "The predicted wine quality is "
## [1] 6
If we use rounded predicted quality values then we predict correctly 58% of the qualities. But if we use quality classes (bad, regular and good), then we increase the success rate to 83%.
We compared our four features(volatile.acidity,citric.acid,sulphates and alcohol) in pair plots, taking into account different classes of wine. The regular wines had a large spread;there is not a good limit between a bad and a regular wine, or regular and a good wine. On the other hand, bad wines and good wines are more distinguishable from one another as we saw in the plot.
We noted that most of the good wines have medium values of citric acid and low values of volatile acidity. Bad wines usually have medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less a horizontal line separating good and bad wines.
Yes, based on the bivariates plot, it seems that there is a positive correlation between “citric.acid” and “quality”. But if we observe the scatter plots by class of wine (only good and bad), we do not see a clear cutoff of “citric.acid” feature to distinguish good and bad wines
I created simple linear models using our four main features. The first model includes only “alcohol” as predictor. Then next model add “volatile.acidity”. Model 3 adds also “sulphates”, and the last model adds “citric.acid”. The R2 values of our models are not very good, although the sucess rates could be a little misleading. One of the main problems is that we have a very unbalanced dataset (too many “regular” wines). Maybe the biggest problem for the model is to distinguish between bad and regular wines, and between good and regular wines.
This plot shows the densities for the distributions of all features in the dataset. We created three quality classes for wine: bad (4 or lower quality values) in red, regular (5 or 6) in green and good (7 or higher) in blue and grouped the features. Those variables with less overlapping in their density curves could help us to distinguish between quality classes. Four of the best features for this purpose are: volatile acidity, citric acid, sulphates and alcohol. Other variables also could help us to detect a specific class, like fixed acidity (good wines) and % free sulfur dioxide (regular wines).
Note: text, values and ticks of Y-axis were removed for clarity
Alcohol by volume and volatile acidity were the two chemical properties most closely related to quality in red wine. Alcohol had a positive relationship with quality, perhaps due to a higher concentration of flavor in wines with higher alcohol percentages. Volatile acidity had a negative relationship with quality rating, due to the fact that higher concentrations can lead to undesirable vinegar-like flavors. As evidenced by the two distinct regions in the plot, the lowest quality wines tended to have lower alcohol percentages and higher volatile acidity concentrations, while the higher quality wines had higher alcohol percentages and lower volatile acidity concentrations, in general.
In this plot we show the pairwise comparison for the six combinations of the main four features. Each combination are represented in a scatter plot. We used a subset of the wines dataset selecting only wines with quality class bad or good. We also deleted some outliers (volatile acidity >= 1.5, citric acid >= 1 and sulphates >= 2). The idea is to show that these features could help to distinguish good wines from bad wines. We are omitting regular wines because their features are so spread that it is not easy to make a distinction; nevertheless, a person usually is not interested in detected a regular wine; he/she usually wants to detect a potential good wine or to avoid a bad wine.
These scatter plots also show density 2D maps for each class. This allows us to see regions or clusters of good wine and bad wine.
We have been analysing a red wine dataset with almost 1,500 observations and 12 features. One of these features is the punctuation or quality for the wine. The objective was to analyse the other features to know their influence in wine quality. After the study of the different distributions for the features, taking into account the qualities, we determined four of the features as the most influential: volatile acidity, citric acid, sulphates and alcohol. After grouping the qualities in three classes (bad, regular and good), we saw that there was a correlation with the main features. This correlation is positive in all cases, except for volatile acidity whose correlation is negative. Multivariate analysis allowed us to see that combinations of the main features could help to determine different “spatial” regions for good wines and bad wines. We have decided that to predict regular wines does not have much sense: most of people usually want to detect a potential good wine (or avoid a bad wine).
According to our study, good wines seem to have lower volatile acidity, higher alcohol and medium-high sulphate values. Bad wines tend to have low values for citric acid; although we have seen, this feature does not improve our predictive models.
For the predictive model, we have been trying a simple linear model with only one main feature, and then adding one by one the other 3 main features. Although the R2 is small, the success rates are more or less high. But this is mainly because we have a problem of unbalanced data: too many “regular” class observations.
In the future work, we should try to improve our modelling procedures balancing the data. Also we could try some algorithm for parameters selection. Other machine learning algorithms could work better for this problem.